# ensuring python version
import sys
sys.version
sys.version_info
Basic questions about dataset¶
How many data points(people) are in the dataset?
from explore_enron_data import enron_data
total = len(enron_data)
For each person, how many features are available?
This should be length of 'value'of each key. Assuming each key or person has same no of features, let us try to get for 1st person.
len(enron_data.items()[0][1])
How many POIs are there?
In other words, count the number of entries in the dictionary where data[person_name]["poi"]==1
count = 0
for k,v in enron_data.iteritems():
if v['poi'] == 1:
count += 1
count
Query further the Dataset¶
Total value of stock belonging to James Prentice?
enron_data.get('PRENTICE JAMES',[])['total_stock_value']
How many email messages do we have from Wesley Colwell to persons of interest?
enron_data.get('COLWELL WESLEY',[])['from_this_person_to_poi']
What's the value of stock options exercised by Jeffrey K Skilling?
enron_data.get('SKILLING JEFFREY K',[])['exercised_stock_options']
Enron CEO during fraud: Jeffrey Skilling
Enron Chairman during fraud: Kenneth Lay
Enron CFO during fraud: Andrew Fastow
Follow the Money¶
Of these three individuals (Lay, Skilling and Fastow), who took home the most money (largest value of 'total_payments' feature)?
from operator import itemgetter
total_payments_dict = {} # caz its person : his payment
poi_list = ['SKILLING JEFFREY K', 'LAY KENNETH L', 'FASTOW ANDREW S']
for each_person in poi_list:
total_payments_dict.update( {each_person : enron_data.get(each_person,[])['total_payments']} )
max(total_payments_dict.iteritems(), key=itemgetter(1)) #ref: https://artemrudenko.wordpress.com/2013/04/12/python-finding-a-key-of-dictionary-element-with-the-highestmin-value/
Unfilled Features¶
How is it denoted when a feature doesn't have a well-defined value?
# testing
enron_data.get('SKILLING JEFFREY K',[])
So answer is NaN
Dealing with unfilled features¶
How many folks in this dataset have a quantified salary?
people_counter = 0 # count only those having quantified salary
for k,v in enron_data.iteritems():
salary = v['salary']
#print salary
if salary != 'NaN':
people_counter += 1
people_counter
How many folks in this dataset have known email address?
email_counter = 0 # count only those having quantified salary
for k,v in enron_data.iteritems():
email = v['email_address']
#print email
if email != 'NaN':
#if '..' not in email: # apparantly this is not a problem..
email_counter += 1
email_counter
Missing POIs 1¶
How many people in the E+F dataset (as it currently exists) have 'NaN' for their total payments? What percentage of people in the dataset as a whole is this?
from __future__ import division # for python 2, sigh..
people_counter = 0 # count only those having unquantified payments that is 'NaN
for k,v in enron_data.iteritems():
total_payments = v['total_payments']
#print email
if total_payments == 'NaN':
people_counter += 1
print people_counter
print people_counter/total
Missing POIs 2¶
How many POIs in the E+F dataset have 'NaN' for their total payments? What percentage of POIs as a whole is this?
from __future__ import division # for python 2, sigh..
poi_nan_counter = 0
poi_total_counter = 0
for k,v in enron_data.iteritems():
total_payments = v['total_payments']
poi = v['poi']
#print email
if poi == True:
poi_total_counter += 1
if total_payments == 'NaN':
poi_nan_counter += 1
print poi_nan_counter
print poi_nan_counter/poi_total_counter
yeah, I double checked. Its 0.
Missing POIs 3¶
If a machine learning algorithm were to use total_payments as a feature, would you expect it to associate a "NaN" value with POIs or non-POIs?
With non-POIs because, in our training dataset, only non-POIs have some NaNs. which means, this could be a feature to learn to distinguish something on non-POIs.
On other hand, all POIs have quantified payments or none have 'NaN', so there is nothing to learn there to distinguish people (by checking if 'NaN' or not)
Missing POIs 4¶
If you added in, say, 10 more data points which were all POI's, and put 'NaN' for the total payments for those folks, the numbers you just calculated would change.
What is the new number of people of the dataset? What is the new number of folks with 'NaN for total payments?
# current no of people in dataset
people_counter = len(enron_data)
# current no of folks with 'NaN' for total payments
nan_counter = 0
for k,v in enron_data.iteritems():
if v['total_payments'] == 'NaN':
nan_counter += 1
# 10 new POIs added, so
people_counter = people_counter + 10
print 'new total: ' + str(people_counter)
# 10 new NaNs
nan_counter = nan_counter + 10
print 'new nans: ' + str(nan_counter)
Missing POIs 5¶
What is the new number of POIs in the dataset? What is the new number of POIs with NaN for total_payments?
# current no of POIs
poi_counter = 0
for k,v in enron_data.iteritems():
if v['poi'] == True:
poi_counter += 1
# after new 10 pois
poi_counter = poi_counter + 10
print poi_counter
# since all earlier pois had quantified total_payments, new no of pois with NaN is 10
Missing POIs 6¶
Once the new data points are added, do you think a supervised classification algorithm might interpret 'NaN' for total_payments as a clue that someone is a POI?
Ans: Of course. Now some POIs have quantified payments and some have NaN. So a person having a NaN total payment could be a POI as well.